{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyzing Locust Load Testing Results\n",
    "\n",
    "This Notebook demonstrates how to analyze **AI Platform Prediction** load testing runs using metrics captured in **Cloud Monitoring**. \n",
    "\n",
    "This Notebook build on the `02-perf-testing.ipynb` notebook that shows how to configure and run load tests against AI Platform Prediction using [Locust.io](https://locust.io/). The outlined testing process results in a Pandas dataframe that aggregates [the standard AI Platform Prediction metrics](https://cloud.google.com/monitoring/api/metrics_gcp#gcp-ml) with a set of custom, log-based metrics generated from log entries captured by the Locust testing script.\n",
    "\n",
    "\n",
    "The Notebook covers the following steps:\n",
    "1. Retrieve and consolidate test results from Cloud Monitoring\n",
    "2. Analyze and visualize utilization and latency results\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "This notebook was tested on **AI Platform Notebooks** using the standard TF 2.2 image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from datetime import datetime\n",
    "\n",
    "from typing import List\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import google.auth\n",
    "\n",
    "from google.cloud import logging_v2\n",
    "from google.cloud.monitoring_dashboard_v1 import DashboardsServiceClient\n",
    "from google.cloud.logging_v2.services.metrics_service_v2 import MetricsServiceV2Client\n",
    "from google.cloud.monitoring_v3.query import Query\n",
    "from google.cloud.monitoring_v3 import MetricServiceClient\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure GCP environment settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_ID = '[your-project-id]' # Set your project Id\n",
    "MODEL_NAME = 'resnet_classifier'\n",
    "MODEL_VERSION = 'v1'\n",
    "LOG_NAME = 'locust' # Set your log name\n",
    "TEST_ID = 'test-20200829-190943' # Set your test Id\n",
    "TEST_START_TIME = datetime.fromisoformat('2020-08-28T21:30:00-00:00') # Set your test start time\n",
    "TEST_END_TIME = datetime.fromisoformat('2020-08-29T22:00:00-00:00') # Set your test end time\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Retrieve and consolidate test results\n",
    "\n",
    "Locust's web interface along with a Cloud Monitoring dashboard provide a cursory view into performance of a tested AI Platform Prediction model version. A more thorough analysis can be performed by consolidating metrics collected during a test and using data analytics and visualization tools.\n",
    "\n",
    "In this section, you will retrieve the metrics captured in Cloud Monitoring and consolidate them into a single Pandas dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 List available AI Platform Prediction metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "creds , _ = google.auth.default()\n",
    "client = MetricServiceClient(credentials=creds)\n",
    "\n",
    "project_path = client.common_project_path(PROJECT_ID)\n",
    "filter = 'metric.type=starts_with(\"ml.googleapis.com/prediction\")'\n",
    "\n",
    "request = {'name': project_path, 'filter': filter}\n",
    "for descriptor in client.list_metric_descriptors(request):\n",
    "    print(descriptor.type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2. List custom log based metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filter = 'metric.type=starts_with(\"logging.googleapis.com/user\")'\n",
    "\n",
    "request = {'name': project_path, 'filter': filter}\n",
    "for descriptor in client.list_metric_descriptors(request):\n",
    "    print(descriptor.type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3. Retrieve test metrics\n",
    "\n",
    "Define a helper function that retrieves test metrics from Cloud Monitoring"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def retrieve_metrics(client, project_id, start_time, end_time, model, model_version, test_id, log_name):\n",
    "    \"\"\"\n",
    "    Retrieves test metrics from Cloud Monitoring.\n",
    "    \"\"\"\n",
    "    def _get_aipp_metric(metric_type: str, labels: List[str]=[], metric_name=None)-> pd.DataFrame:\n",
    "        \"\"\"\n",
    "        Retrieves a specified AIPP metric.\n",
    "        \"\"\"\n",
    "        query = Query(client, project_id, metric_type=metric_type)\n",
    "        query = query.select_interval(end_time, start_time)\n",
    "        query = query.select_resources(model_id=model)\n",
    "        query = query.select_resources(version_id=model_version)\n",
    "        \n",
    "        if metric_name:\n",
    "            labels = ['metric'] + labels \n",
    "        df = query.as_dataframe(labels=labels)\n",
    "        \n",
    "        if not df.empty:\n",
    "            if metric_name:\n",
    "                df.columns.set_levels([metric_name], level=0, inplace=True)\n",
    "            df = df.set_index(df.index.round('T'))\n",
    "        \n",
    "        return df\n",
    "    \n",
    "    def _get_locust_metric(metric_type: str, labels: List[str]=[], metric_name=None)-> pd.DataFrame:\n",
    "        \"\"\"\n",
    "        Retrieves a specified custom log-based metric.\n",
    "        \"\"\"\n",
    "        query = Query(client, project_id, metric_type=metric_type)\n",
    "        query = query.select_interval(end_time, start_time)\n",
    "        query = query.select_metrics(log=log_name)\n",
    "        query = query.select_metrics(test_id=test_id)\n",
    "        \n",
    "        if metric_name:\n",
    "            labels = ['metric'] + labels \n",
    "        df = query.as_dataframe(labels=labels)\n",
    "        \n",
    "        if not df.empty: \n",
    "            if metric_name:\n",
    "                df.columns.set_levels([metric_name], level=0, inplace=True)\n",
    "            df = df.apply(lambda row: [metric.mean for metric in row])\n",
    "            df = df.set_index(df.index.round('T'))\n",
    "        \n",
    "        return df\n",
    "    \n",
    "    # Retrieve GPU duty cycle\n",
    "    metric_type = 'ml.googleapis.com/prediction/online/accelerator/duty_cycle'\n",
    "    metric = _get_aipp_metric(metric_type, ['replica_id', 'signature'], 'duty_cycle')\n",
    "    df = metric\n",
    "\n",
    "    # Retrieve CPU utilization\n",
    "    metric_type = 'ml.googleapis.com/prediction/online/cpu/utilization'\n",
    "    metric = _get_aipp_metric(metric_type, ['replica_id', 'signature'], 'cpu_utilization')\n",
    "    if not metric.empty:\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve prediction count\n",
    "    metric_type = 'ml.googleapis.com/prediction/prediction_count'\n",
    "    metric = _get_aipp_metric(metric_type, ['replica_id', 'signature'], 'prediction_count')\n",
    "    if not metric.empty:\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve responses per second\n",
    "    metric_type = 'ml.googleapis.com/prediction/response_count'\n",
    "    metric = _get_aipp_metric(metric_type, ['replica_id', 'signature'], 'response_rate')\n",
    "    if not metric.empty:\n",
    "        metric = (metric/60).round(2)\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve backend latencies\n",
    "    metric_type = 'ml.googleapis.com/prediction/latencies'\n",
    "    metric = _get_aipp_metric(metric_type, ['latency_type', 'replica_id', 'signature'])\n",
    "    if not metric.empty:\n",
    "        metric = metric.apply(lambda row: [round(latency.mean/1000,1) for latency in row])\n",
    "        metric.columns.set_names(['metric', 'replica_id', 'signature'], inplace=True)\n",
    "        level_values = ['Latency: ' + value for value in metric.columns.get_level_values(level=0)]\n",
    "        metric.columns.set_levels(level_values, level=0, inplace=True)\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve Locust latency\n",
    "    metric_type = 'logging.googleapis.com/user/locust_latency'\n",
    "    metric = _get_locust_metric(metric_type, ['replica_id', 'signature'], 'Latency: client')\n",
    "    if not metric.empty:\n",
    "        metric = metric.round(2).replace([0], np.nan)\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve Locust user count\n",
    "    metric_type = 'logging.googleapis.com/user/locust_users'\n",
    "    metric = _get_locust_metric(metric_type, ['replica_id', 'signature'], 'User count')\n",
    "    if not metric.empty:\n",
    "        metric = metric.round()\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve Locust num_failures\n",
    "    metric_type = 'logging.googleapis.com/user/num_failures'\n",
    "    metric = _get_locust_metric(metric_type, ['replica_id', 'signature'], 'Num of failures')\n",
    "    if not metric.empty:\n",
    "        metric = metric.round()\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "    \n",
    "    # Retrieve Locust num_failures\n",
    "    metric_type = 'logging.googleapis.com/user/num_requests'\n",
    "    metric = _get_locust_metric(metric_type, ['replica_id', 'signature'], 'Num of requests')\n",
    "    if not metric.empty:\n",
    "        metric = metric.round()\n",
    "        df = df.merge(metric, how='outer', right_index=True, left_index=True)\n",
    "\n",
    "    return df\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_result = retrieve_metrics(\n",
    "    client, \n",
    "    PROJECT_ID, \n",
    "    TEST_START_TIME, \n",
    "    TEST_END_TIME, \n",
    "    MODEL_NAME, \n",
    "    MODEL_VERSION,\n",
    "    TEST_ID, \n",
    "    LOG_NAME\n",
    ")\n",
    "\n",
    "test_result.head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The retrieved dataframe uses [hierarchical indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) for column names. The reason is that some metrics contain multiple time series. For example, the GPU `duty_cycle` metric includes a time series of measures per each GPU used in the deployment (denoted as `replica_id`). The top level of the column index is a metric name. The second level is a `replica_id`. The third level is a `signature` of a model.\n",
    "\n",
    "All metrics are aligned on the same timeline. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Analyzing and Visualizing test results\n",
    "\n",
    "In the context of our scenario the key concern is GPU utilization at various levels of throughput and latency. The primary metric exposed by AI Platform Prediction to monitor GPU utilization is `duty cycle`. This metric captures an average fraction of time over the 60 second period during which the accelerator(s) were actively processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1. GPU utilization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gpu_utilization_results = test_result['duty_cycle']\n",
    "gpu_utilization_results.columns = gpu_utilization_results.columns.get_level_values(0)\n",
    "ax = gpu_utilization_results.plot(figsize=(14, 9), legend=True)\n",
    "ax.set_xlabel('Time', fontsize=16)\n",
    "ax.set_ylabel('Utilization ratio', fontsize=16)\n",
    "_ = ax.set_title(\"GPU Utilization\", fontsize=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2. CPU utilization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cpu_utilization_results = test_result['cpu_utilization']\n",
    "cpu_utilization_results.columns = cpu_utilization_results.columns.get_level_values(0)\n",
    "ax = cpu_utilization_results.plot(figsize=(14, 9), legend=True)\n",
    "ax.set_xlabel('Time', fontsize=16)\n",
    "ax.set_ylabel('Utilization ratio', fontsize=16)\n",
    "_ = ax.set_title(\"CPU Utilization\", fontsize=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3. Latency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "latency_results = test_result[\n",
    "    [x[0] for x in test_result.columns if x[0].startswith('Latency:')]]\n",
    "latency_results.columns = latency_results.columns.get_level_values(0)\n",
    "ax = latency_results.plot(figsize=(14, 9), legend=True)\n",
    "ax.set_xlabel('Time', fontsize=16)\n",
    "ax.set_ylabel('milisecond', fontsize=16)\n",
    "_ = ax.set_title(\"Latency\", fontsize=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4. Request throughput\n",
    " \n",
    "We are going to use the `response_rate` metric, which tracks a number of responses returned by AI Platform Prediction over a 1 minute interval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "throughput_results = test_result[['response_rate']]\n",
    "throughput_results.columns = throughput_results.columns.get_level_values(0)\n",
    "ax = throughput_results.plot(figsize=(14, 9), legend=True)\n",
    "ax.set_xlabel('Time', fontsize=16)\n",
    "ax.set_ylabel('Count', fontsize=16)\n",
    "_ = ax.set_title(\"Response Rate\", fontsize=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning up: delete the log-based metrics and dasboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logging_client = MetricsServiceV2Client(credentials=creds)\n",
    "parent = logging_client.common_project_path(PROJECT_ID)\n",
    "\n",
    "for element in logging_client.list_log_metrics({'parent': parent}):\n",
    "    metric_path = logging_client.log_metric_path(PROJECT_ID, element.name)\n",
    "    logging_client.delete_log_metric({'metric_name': metric_path})\n",
    "    print(\"Deleted metric: \", metric_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display_name = 'AI Platform Prediction and Locust'\n",
    "dashboard_service_client = DashboardsServiceClient(credentials=creds)\n",
    "parent = 'projects/{}'.format(PROJECT_ID)\n",
    "for dashboard in dashboard_service_client.list_dashboards({'parent': parent}):\n",
    "    if dashboard.display_name == display_name:\n",
    "        dashboard_service_client.delete_dashboard({'name': dashboard.name})\n",
    "        print(\"Deleted dashboard:\", dashboard.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## License\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-2-2-gpu.2-2.m50",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-2-2-gpu.2-2:m50"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
